Skip to content

[TUTORIALS] Add multicta tutorial#9654

Open
lezcano wants to merge 2 commits intomainfrom
tutorial_multicta
Open

[TUTORIALS] Add multicta tutorial#9654
lezcano wants to merge 2 commits intomainfrom
tutorial_multicta

Conversation

@lezcano
Copy link
Contributor

@lezcano lezcano commented Mar 5, 2026

We go over all the bits and pieces necessary to write a multiCTA kernels
in Gluon. We finalise with a recipe to get SOTA perf on a dense matmul.

We also change the bench tool used in tutorial 8 (as cudagraph bench
does not zero out the L2) and the numbers from cublas as I was not
able to repro the numbers there (probably they have optimised cublas
in a newer version).

We go over all the bits and pieces necessary to write a multiCTA kernels
in Gluon. We finalise with a recipe to get SOTA perf on a dense matmul.

We also change the bench tool used in tutorial 8 (as cudagraph bench
does not zero out the L2) and the numbers from cublas as I was not
able to repro the numbers there (probably they have optimised cublas
in a newer version).
@lezcano lezcano requested a review from ptillet as a code owner March 5, 2026 13:21
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32682077d6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".



if __name__ == "__main__" and not is_blackwell():
raise RuntimeError("This tutorial requires a Blackwell NVIDIA GPU")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why you still have the is_hopper_or_newer function then?

# `two_ctas=True`, because only the lead CTA waits before issuing the MMA.
#
# Once one `tcgen05_mma` in a kernel uses 2CTA mode, all of the `tcgen05_mma`
# instructions in that kernel must use 2CTA mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this requirement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants